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Tn [1]: import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import nunpy as np 
sns.set_thene(color_codes=True) 


In [2]: af = pd.read_csv('pizza_vl.csv') 


d head) 
onta) 
company prica supiah diameter topping variant эш 

о А Raso 220 сов doula snatire junio = y 
1 A RPSROOO 200 paperoni double sabre jumbo yes yes 
2 А ЗОО 180 mushrooms double signature reser » yes 
3 А або — 140 amoked beet double signature reguler y = 
4 л 32800 180 пора double sorelum junto y» ro 
Data Preprocessing Part 1 

In [3]: # renove "Rp" and comma from "price rupiah" cotumn 


GFL "peice rupiah'} = dé[ price rupiah' .str.replace('Rp', '').str.replace(",', '') 


Tn [4]: ef.nead() 


оца} 
company pres rupiah diameter topping variam аша енна sauce 
D A 21900 20 dicen подне signature jumbo m 
Й A 188000 200 poppero double signatura jumbo = 
2 A 120000 160 mushrooms double signature reguler ve 
B A їй 140 smoked beef double хопа пам yes LI 
4 A 24800 180 mozzarela doube snae jumbo y» m 


їп [5]: KCheck the number of unique value on object datatype 
f. select dtypes (include- "object" ).nunique() 


outs]: company 5 
price rupi — 43 
‘topping 2 
variant 20 
size 5 
extrasauce 2 
extracheese 2 


dtype? intee 


їл [6]: # convert "Amount" colum to integer 
ef['price_rupiah'] = @#[°ргїсе_гир1аһ' J astype(int) 


Segment Pizza Variant 


In 7]: éf.variant.unique() 


бш[7}: anray([‘double_signature’, ‘anerican_favorite’, 'super supreme', 
eat lovers', "double mix’, "classic, ‘crunchy’, "newyork", 
double беске", 'splcy tuna', "BBQ meat, fiesta", 'BBQ sausage', 
‘extravaganza’, "seat eater', gourmet greek', "italiam vepgle'; 
thai veggie', "american classic", "neptune tuna, "spicy tuna], 
diype-object) 
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In [а]: # define function to segment pizza потез into types 


elif "meat! in variant or 'BBQ' in variant: 
return "West" 

elf ‘tuna’ in variant: 
return "Seafood" 

else: 
return "other" 


# opply function to ‘Pizza Nome' column to create new ‘Pizza Type’ colum 
вер variant] = df[ variant’ J apply(segnent_variant) 

In [9]: plt.figure(figsize=(19,5)) 
éf['vartant].value_counts()-plot(kind='bar') 

бшт[з]: ehxesSubplot:» 
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Exploratory Data Analysis 


localhost 8892/notebooks/Pizza Price Predictioniipynb ana 


424123, 12:31 AM Pizza Price Prediction - Jupyter Notebook 


In [11]: # List of categorical variables to plot 
саб хаг = ['coepony', ‘topping’, ‘variant’, 'size', "extra sauce', "extra cheese'] 


# create figure with sutpLots 
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(20, 19) 
axs = axs.ravel() 


# create barplot for each categorical variable 

for i, var in enunerate(cat_vars) 
sns.barplot(x-var, ys'price rupian', datasdf, а 
axs[i].set_xticklabels (axs! 


s[i], estimator=np.nean) 
«get xticklabels(), rotation=98) 


# adjust spacing between subplots 
‘ig. tignt_layout() 


а show plot 
РЕ. зон 


КҮТ il 
lan : 
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n 
ЕКЕ 


1 ' 
їп [32]: sns.boxplot (x="dianeter", data-df) 
Out[32]: ehxesSubplot:xlabele'dlaseter'» 

E ED . 

8 о 2 м в в 22 2 


diameter 
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In {зз}: sns.violinplot(x='diameter', data-df) 


{1з}: <AxesSubplot:xlabe 


dianeter’> 


50 75 100 125 150 175 200 25 250 
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In [34]: sns.scatterplot(data=af, x-"disneter", ys“price_rupiah", hue=“company") 
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Data Preprocessing Part 2 


10 [35]: @#.легщ) 


ОН 
company pce rupiah diameter 

П А в эп 

4 A 198000 200 

2 A аю сш 

à A 185000 омо 

4 A deo саз 
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out(26] 
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17] 


па] 


#Check missing volue 
check sissing = df.isnul1().sun() * 10 / df.shape[9] 
check sissing[check missing > 6].sort values(ascending 


Series({], dtype: floatea) 


Label Encoding for Object datatype 


# Loop over each column {п the DataFrame where dtype is 'obJect' 
for col in df.select_dtypes(include=[ "object" ]) colunns: 


# Print the colum name and the unique values 
print(#*{eol}: {4F[eol]-unique()}") 


company: ГРА? 'B' "C b ne] 

‘topping: ['chicken' 'papperoni' ‘mushrooms’ ‘smoked beef" "mozzarella" 
lack papper’ ‘tuna’ ‘neat’ ‘sausage’ ‘onion’ ‘vegetables’ "beef" ] 

variant: ["Other’ "Meat! "Seafood" ‘Vegetarian’ ] 

size: [‘Junbo! ‘reguler’ 'small' ‘mediun’ ‘large’ 'XL'] 

extra sauce: ['yes" "no'] 

extra cheese: ['yes' "no'] 


fron sklearn import preprocessing 


# Loop over each column in the Datafrane where dtype is ‘object 
for col in df.select, dtypes (include-[ ‘object ]) columns: 


# Initialize a LobeLEncoder object 
label encoder = preprocessing.LabelEncoder() 


# Fit the encoder to the unique values in the colum 
label, encoder. Fit (8F[co1] unique()) 


# Transform the column using the encoder 
dé[col] = label, encoder. transform (*{col]) 


# Print the column name and the unique encoded values 
print(#"{col}: {4F[col]-unique()}") 


company: [0 12 3 4] 
topping: [2 7 5 9 а 110 з 8 611 е] 
variant: [i @ 23] 

size: [14 5 3 2 0] 

extra, sauce: [1 0] 

extra cheese: [1 0] 


I will not remove the outlier because the dataset 


localhost 8892/notebooks/Pizza Price Predictionipyni 


is very small 


sna 


4/24123, 52.31 АМ Pizza Price Prediction - Jupyter Notebook. 


Tn [15]: 


Ts [20]: 


In t21]: 


ACorrelation Heatmop 
plt.figure(figsizes(20, 16)) 
Sns neatnap(df.corr(), #тт=' .2g', annot=True) 


<Axessubplot:> 


Train Test Split 


X = df.drop( ‘price rupiah', axiset) 


test size 201 and train size 80x 
from sklearn.nodel selection import train test split 
from sklearn.netrics import accuracy score 

X train, X test, y train, y test = train _test_split(X,y, test, size-6.2,randon state-ó) 


Decision Tree Regressor 
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In [22] 


Газ] 


1n [24] 


from sklearn.tree import DecisionTreeRegressor 
fron sklern.nodel selection import Gridsearchcv 
from sklearn.óstasets import load boston 


# Create a DecisiontreeRegressor object 
tree = DecisionTreekegressor() 


з Define the hyperparaneters to tune ond their volues 
peram grid = ( 

"нах depth: (2, 4, 6, 8), 

"inc sonples split! [2, 4, в, в], 

‘min -samples leaf": [1, 2, 3, 4], 

"ma features’! ['autol, sart", lop] 


› 


# Create а GridsearchcV object 
grid search = GridSearchCV(dtree, param grid, cv=5, scoring 


# Fit the Gridseorchcv object to the dato 
grid search. ЕО train, y train) 


а Print the best hyperparaneters 
print(gri¢_searen.best_paraas_) 


C'max depth': 8, "max features’: ‘auto’, "min samples leaf: 1, "ein samples split" 


from sklesrn.tree import DecisionTreeRegressor 
tree = DecisionTreehegressor(randon stat: 
tree. Fit (C train, y train) 


DecisionTreeRegressor(nax. dept^ 


пах feature: 


fron sklearn import metrics 
fron sklearn.metrics import sean absolute percentage error 
inport math 

у_ргей = dtree.predict(X test) 

mae = setrics.nean absolute, error(y test, y prec) 

mape = mean sbsolute percentage error(y test, y pred) 

mse = metrics.nean squared error(y test, y pred] 

r2 = metrics.r2_score(y test, y pred) 

тте = math.sqrt(nse) 


print(MAE їз ()*.format(mae)) 

print MAPE is ()' format (nape)) 
print (Hse is ()".Format(mse)) 
rint(‘R2 score 15 ()'.format(r2)) 
Print (‘ANSE score is ()'.format(rmse)) 


MAE is 8896.153846153846 
МАРЕ is 0.11478195348575036 

MSE 15 175730965.46310833 

R2 score is 0.7989720567793099 
RISE score is 13180.704285549704 


localhost 8892/notebooks/Pizza Price Prediction ipyn 


Pizza Price Prediction - Jupyter Notebook 


meg.mean squared enror') 


|, max_depth=8, max festures-'suto', min samples le 


auto’, randos state-8) 


, min samples split= 


ma 


4/24123, 52.31 АМ 


In [25]: 
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X train. coluens, 
tree. feature inportances 


"Importance", 
» 


fi = 19р df. sort, values(bys"Inoortonce", ascending-False) 


#12 = fi.head(10) 
plt.figure(figsize=(10,8)) 

Ssns.barplot(data=f12, е Inportance', y='Feature Nane') 

plt;title('Feature Importance Each Attributes (Decision Tree Regressor)', fontsize=18) 
Plt.xlabel ('Inpartance', fontsize=16) 

plt.ylabel (‘Feature Nave", fontsize=16) 

PItsho«() 


Feature Importance Each Attributes (Decision Tree Regressor) 


Feature Name 


oo o1 o2 оз 04 os °в 
Importance 
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Tn [26]: import shap 
ехр1аїпег = shap.Treetxplainer(dtree) 
Shap values = wxplainer.shap values(X test) 
shap;sumary. plot(shap values, X test) 
Hon 
diameter уке л Кау o 
die е, йй» ee 
topping =. А 
compan . . ° 
рапу h š 
extra sauce — * oto š 
extra cheese. = 
variant + 
0 ] ; I ] ] low 
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SHAP value (impact on model output) 
ла [27]: explainer = shap.Explainer(dtree, X test) 


Shap values = explatner(X test) 
shap.plots waterfall (shap values[0]) 
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Random Forest Regressor 
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In [28]: from sklearn.ensemble import RandomForestRegressor 
from sklearn.model selection import GridsearchcV 


# Create g Random Forest Regressor object 
rf = RandoaForestRegressor() 


# Define the hyperparaneter grid 
param grid = ( 
"nix depth': [3, 5, 7, 9], 
nin samples split': D, 5, 10], 
"min samples leat": [1, 2, 4], 
max Features: ['auto', 'sart'] 


B 


# Create a GridSearchCY object 
grid search = GridsearchcV(rf, param grid, суб, scoringe r2") 


3 Fit the Gridsearchcv object to the training data 
grád search. ЕКО train, y train) 


# Print the best hyperparaneters 
Print(“Best hyperparameters: ", grid_search.best_params_) 


Best hyperparameters: {’nax_depth': 9, ‘sax features’: "auto, "ein samples leaf": 1, "nin samples split 


In [29]: from sklearm.ensemble import RandonForestRegressor 
rf = RandoaForestRegressor(randor_state=8, max depth-S, min saeples splite?, min sasples leaf-l, 
ax, festures- auto") 


СИЗ train, y train) 


Out [28]: RandontorestRegressor(nox dept 


j, randow stat 


In [38): from sklearn import metrics 
fron sklearn.metrics import sean absolute percentage error 
import math 
yipred = rf.predict(X test) 
mae = metrics.neam absolute error(y test, y pred) 
mape mean absolute percentage error(y. test, y pred) 
mse = metrics.nean squared error(y test, y pred) 

P2 metrics.r2 score(y test, y pred) 
тте = math.sart(mse) 


print(’MAe is (}".format(mae)) 

print (’MAPE is ()' format(nape)) 
print(NSe is ()" Format(mse)) 
print(’R? score 15 ()'.fornmat(r2)) 
Print( RMSE score is {)".format(rase)) 


MAE is 10979.558705183706 
PAPE is @.16435802453302076 

MSE is 174617535.78390014 

R2 score is 0.7979461866494185 
RUSE score is 13214.292859774985 
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In [a]: 
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X train. coluens, 
tree. feature inportances 


"Importance", 
» 


fi = inp_of.sort_values(by="Inportance", ascendingsFalse) 


#12 = fi.head(10) 
plt.figure(figsize=(10,8)) 

Ssns.barplot(data=f12, x='Inportance', y='Feature Nane') 

plt.title( ‘Feature Importance Each Attributes (Random Forest Regressor)', fontsize=18) 
Plt.xlabel ('Inpartance', fontsize=16) 

plt.ylabel (‘Feature Nave", fontsize=16) 

PItsho«() 


Feature Importance Each Attributes (Random Forest Regressor) 


Feature Name 


oo o1 o2 оз 04 os °в 
Importance 
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In [32]: import shap 
explainer = shap.Treetxplaner(rf) 
shap values = explainer.shap values(X test) 
эһар.зиттагу_р1о$(зһар_уашев, X test] 
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SHAP value (Impact on model output) 


їп [23]: explainer = shap.txplainer(rf, X test, check additivitysFalse) 
Shap values = explainer(X test, check sdditivitysFalse) 
shap.plots waterfall (shap values[0]) 
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AdaBoost Regressor 
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Tn [34]: from sklearn.ensemble inport AdaBoostRegressor 
from sklearn.model selection import GridsearchcV 


# Define AdaBocstfegressor model 
abr = AdaBoostRegressor() 


# Define hyperparameters ond possible values 
params = ( n estimators : [SO, 108, 150], 
“Tearning_rate': [0.01, 9.1, 1, 10]) 


# Perform Gridsearchtv with s-fold cross validation 
arid search = GrigSearchCV(abr, param grid-params, cveS, storinge'neg sean squared error") 
grid search.fit(X train, y train) 


# Print best hyperparaneters and corresponding score 
rant("sest hyperparameters: ", grid_search.best_parans_) 


Best hyperparaneters: {"learning rate': 1, ‘n_estimators': 58) 
їп [35]: from sklearn.ensenble inport RandonForestRegressor 
abr = AdsBoosthegressor(randon state-é, learning rate-i, n_estinators=5¢) 
abr. fit(X train, y train) 


005135]: AdaBoostRegressor(learning, rates 


|, random state-b) 


In [36]: from sklearn import metrics 
fron sklearn.netrics import sean absolute percentage error 
import math 
y_pred = abr.predict(X test) 
тае = seteics.nean absolute error(y test, y pred) 
mape = mean 3bsolute percentage enror(y, test, y pred) 
mse = metrics.nean squared error(y test, y pred) 
г2 = metrics.r2 score(y test, y pred) 
стве = math.sart(mse) 


print MAS is ()*.format(mae)) 
printCMAPE is ()'.format(nape)) 
printCMSt is ()".Format(mse)) 
print('R2 score is ()'.format(r2)) 
print(`RMSE score is (]'.format(mse)) 


MAE ds 11310.953520883801 
MAPE is 0.18546632012903405 

MSE ds 213998142.11136267 

R2 score 15 @.752378015933912 
RUSE score is 14628.675336863645 
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In [27]: 
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X_train-coluens, 
abr_feature_inportances_ 


fi р df. sort, values hys" Ingortance*, ascending=False) 


#12 = fi.head(10) 
plt.figure(figsize=(10,8)) 

Sns_barplot(dəta=f12, x='Inportance', y='Feature Name’) 

plt;title('Feature Importance Each Attributes (Adasoost Regressor)', fontsize=18) 
Plt.xlabel ('Inpartance', fontsize=15) 

plt.ylabel (‘Feature Nave", fontsize=16) 

plt-shox() 


Feature Importance Each Attributes (AdaBoost Regressor) 


Feature Name 


oo o1 02 оз 04 
Importance 
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